October 5, 2015

Setting Expectations

Expect to:

  1. Get a sense of what is possible with R.
  2. Set up important frameworks around how to do data science (forewarning - I'm biased!)
  3. Be warned of potential minefields

Do Not Expect to:

  1. Completely understand all the code presented on your first try.
  2. Be exposed to all of the features of the mentioned packages.

The Data Scientist's Workflow

R Solutions

Case Study Overview

Goal: Reproduce the final pharmacogenomic regression model and create an interactive web dose calculator.


This study used clinical and genetic data from a broad population to estimate appropriate warfarin dose for patients newly starting warfarin.

The main data set for the study is available on PharmGKB.

Data Set

This data set is in an excel file and presents some challenges:

  • Column names with odd symbols
  • Data types change with different project sites

For the purpose of this tutorial I have reduced the complexity of the data set and converted it to a tab-delimited file available on my GitHub. We will use read.delim() to deal with the odd column names.

iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt")
iwpc_data %<>% tbl_df()

Checking out our Data

iwpc_data %>% 
  group_by(Project.Site) %>% 
  sample_n(1) %>% 
  datatable(rownames = FALSE, options = list(pageLength=2, scrollX = TRUE))

Tidying our Data - Column Names

iwpc_data %<>% 
  rename(subject_id = PharmGKB.Subject.ID,
         sample_id = PharmGKB.Sample.ID,
         project_site = Project.Site,
         gender = Gender,
         race_reported = Race..Reported.,
         race_omb = Race..OMB.,
         ethnicity_reported = Ethnicity..Reported.,
         ethnicitiy_omb = Ethnicity..OMB.,
         age = Age,
         height = Height..cm.,
         weight = Weight..kg.,
         indication = Indication.for.Warfarin.Treatment,
         comorbidities = Comorbidities,
         medications = Medications,
         target_inr = Target.INR,
         target_inr_estimated = Estimated.Target.INR.Range.Based.on.Indication,
         reached_stable_dose = Subject.Reached.Stable.Dose.of.Warfarin,
         therapeutic_warfarin_dose = Therapeutic.Dose.of.Warfarin,
         inr_on_warfarin = INR.on.Reported.Therapeutic.Dose.of.Warfarin,
         smoker = Current.Smoker,
         cyp2c9_consensus = CYP2C9.consensus,
         vkorc1_1639_consensus = VKORC1..1639.consensus)

Data Manipulation - Which Variables?

They Used:

  • Age in decades = 1 for 10-19, etc…
  • VKORC1 G/A = 1 if heterozygous
  • VKORC1 A/A = 1 if homozygous for A
  • VKORC1 genotype unknown = 1
  • CYP2C9 *1/*2 = 1 if *1/*2
  • CYP2C9 *1/*3 = 1 if *1/*3
  • CYP2C9 *2/*2 = 1 if homozygous *2
  • CYP2C9 *2/*3 = 1 if *2/*3
  • CYP2C9 *3/*3 = 1 if homozygous *3
  • CYP2C9 genotype unknown = 1
  • Asian Race = 1
  • Black/African American = 1
  • Missing or Mixed race = 1


  • Amiodarone status = 1
  • Enzyme inducer status = 1 if any of: rifampin, carbamazepine, phenytoin, rifampicin

We Have:

  • Age: 10-19, 20-29, 30-39 etc.
  • VKORC1: A/A, A/G, G/G
  • CYP2C9: combinations of: *1, *2, *3, *5, *6, *8, *11, etc.
  • Race: Asian, Black or African America, White, Other
  • Medications: character list of medications, lack of medications, etc.

Data Manipulation: Dummy Code Race

iwpc_data %<>% 
  mutate(asian = ifelse(str_detect(race_omb, "Asian"),
                        yes = 1,
                        no = 0),
         african_american = ifelse(str_detect(race_omb, "Black or African American"),
                                   yes = 1, 
                                   no = 0),
         missing_or_mixed_race = ifelse(str_detect(race_omb, "Unknown"),
                                        yes = 1,
                                        no = 0))
Race OMB Asian African American Missing/Mixed Race N
Asian 1 0 0 1634
Black or African American 0 1 0 462
Unknown 0 0 1 482
White 0 0 0 3122

Data Manipulation: VKORC1

iwpc_data %>% 
  mutate(vkorc1_1639_ag = ifelse(str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 1, no = 0),
         vkorc1_1639_aa = ifelse(str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 1,no = 0),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1,no = 0))
VKORC1 1639 VKORC1 A/G VKORC1 A/A VKORC1 Unknown N
A/A 0 1 0 1485
A/G 1 0 0 1470
G/G 0 0 0 1246
NA NA NA 1 1499

Warning: Running str_detect() on NA returns NA. If you don't know that data value, R doesn't know if there's a match!

Data Manipulation: VKORC1

iwpc_data %<>% 
  mutate(vkorc1_1639_ag = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 0,  no = 1),
         vkorc1_1639_aa = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 0, no = 1),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1, no = 0))
VKORC1 1639 VKORC1 A/G VKORC1 A/A VKORC1 Unknown N
A/A 0 1 0 1485
A/G 1 0 0 1470
G/G 0 0 0 1246
NA 0 0 1 1499

Data Manipulation: CYP2C9

iwpc_data %<>% 
  mutate(cyp2c9_1_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_1_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_2_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_2_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_3_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*3/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_unknown = ifelse(is.na(cyp2c9_consensus),
                                 yes = 1,no = 0))

Data Manipulation: CYP2C9

CYP2C9 1/2 1/3 2/2 2/3 3/3 Unknown N
1/1 0 0 0 0 0 0 4157
1/11 0 0 0 0 0 0 6
1/13 0 0 0 0 0 0 1
1/14 0 0 0 0 0 0 1
1/2 1 0 0 0 0 0 737
1/3 0 1 0 0 0 0 498
1/5 0 0 0 0 0 0 6
1/6 0 0 0 0 0 0 3
2/2 0 0 1 0 0 0 56
2/3 0 0 0 1 0 0 69
3/3 0 0 0 0 1 0 22
NA 0 0 0 0 0 1 144

Data Manipulation: Medications

Data Manipulation: Amiodarone

Data Manipulation: Enzyme Enducers

Data Manipulation: Age

Data Visualization: Warfarin Dose

Data Manipulation: Warfarin Dose